skimr and visdat packagesThe Black Lives Matter protests over the last several years have made us aware of the racial aspects of policing.
Here we’re specifically interested in
These aren’t very precise, but that’s okay: Part of the goal of EDA is to clarify and refine our research questions
(Adapted from Tanweer et al. (2021), 14-15, and Liboiron (2021))
We’ll be using data on police stops in Oakland, California, collected and published by the Stanford Open Policing Project.
For reproducibility, we’ll write a bit of code that automatically downloads the data
To get the download URL:
README: https://github.com/stanford-policylab/opp/blob/master/data_readme.md.
The dataset is a zipped csv or comma-separated value file. CSVs are structured like Excel spreadsheets, but are stored in plain text rather than Excel’s format.
## Rows: 133407 Columns: 28
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): raw_row_number, location, beat, subject_race, subject_sex, officer_assignment, type, ...
## dbl (3): lat, lng, subject_age
## lgl (7): arrest_made, citation_issued, warning_issued, contraband_found, contraband_drugs, con...
## date (1): date
## time (1): time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Peng and Matsui (2016) use some base R functions to look at dimensions of the dataframe and column (variable) types. skimr is more powerful.
## ── Data Summary ────────────────────────
## Values
## Name dataf
## Number of rows 133407
## Number of columns 28
## _______________________
## Column type frequency:
## character 16
## Date 1
## difftime 1
## logical 7
## numeric 3
## ________________________
## Group variables None
##
## ── Variable type: character ────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max empty n_unique whitespace
## 1 raw_row_number 0 1 1 71 0 133407 0
## 2 location 51 1.00 1 78 0 60723 0
## 3 beat 72424 0.457 3 19 0 129 0
## 4 subject_race 0 1 5 22 0 5 0
## 5 subject_sex 90 0.999 4 6 0 2 0
## 6 officer_assignment 121431 0.0898 5 97 0 20 0
## 7 type 20066 0.850 9 10 0 2 0
## 8 outcome 34107 0.744 6 8 0 3 0
## 9 search_basis 92250 0.309 5 14 0 3 0
## 10 reason_for_stop 0 1 14 197 0 113 0
## 11 use_of_force_description 116734 0.125 10 10 0 1 0
## 12 raw_subject_sdrace 0 1 1 1 0 7 0
## 13 raw_subject_resultofencounter 0 1 7 213 0 315 0
## 14 raw_subject_searchconducted 0 1 2 24 0 34 0
## 15 raw_subject_typeofsearch 52186 0.609 2 112 0 417 0
## 16 raw_subject_resultofsearch 111633 0.163 5 95 0 298 0
##
## ── Variable type: Date ─────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max median n_unique
## 1 date 2 1.00 2013-04-01 2017-12-31 2015-07-19 1638
##
## ── Variable type: difftime ─────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max median n_unique
## 1 time 2 1.00 0 secs 86340 secs 16:12 1439
##
## ── Variable type: logical ──────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean count
## 1 arrest_made 0 1 0.121 FAL: 117217, TRU: 16190
## 2 citation_issued 0 1 0.394 FAL: 80836, TRU: 52571
## 3 warning_issued 0 1 0.231 FAL: 102545, TRU: 30862
## 4 contraband_found 92250 0.309 0.149 FAL: 35005, TRU: 6152
## 5 contraband_drugs 92250 0.309 0.0844 FAL: 37684, TRU: 3473
## 6 contraband_weapons 92250 0.309 0.0299 FAL: 39928, TRU: 1229
## 7 search_conducted 0 1 0.309 FAL: 92250, TRU: 41157
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 lat 114 0.999 37.8 0.0284 37.4 37.8 37.8 37.8 38.1 ▁▁▇▁▁
## 2 lng 114 0.999 -122. 0.0432 -122. -122. -122. -122. -119. ▇▁▁▁▁
## 3 subject_age 102724 0.230 33.2 13.3 10 23 29 41 97 ▇▆▃▁▁
raw_row_number has 1 unique value per row
subject_race and subject_sex have just 5 and 2 unique values
type, outcome, and search_basis
n_missing, low complete_rate)difftime
?difftime tells us that difftime is used to represent intervals or “time differences”search_conducted and contraband_foundsearch_conducted has no missing values, but contraband_found has a lot of missing valuessubject_race is 100% completesearch_conducted is also 100% completecontraband_found is only 31% completevisdat::vis_miss()to
contraband_found.But this raises a warning about large data
So we’ll use sample_n() to draw a subset
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## Please use `gather()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
Arguments in vis_miss() are useful for picking up patterns in missing values
## cluster = TRUE uses hierarchical clustering to order the rows
vis_miss(dataf_smol, cluster = TRUE) +
coord_flip()Several variables related to search outcomes are missing together
contraband_found,contraband_drugs,contraband_weapons,search_basis,use_of_force_description,raw_subject_typeofsearch, andraw_subject_resultofsearch
However, search_conducted is complete
When a search has been conducted, do we know whether contraband was found?
contraband_found is missing?With 28 columns, the dataframe is too wide to print in a readable way.
Instead we’ll use the base R function View() in an interactive session. This shows us an Excel-like spreadsheet presentation of a dataframe.
View() can cause significant problems if you use it with a large dataframe on a slower machine. So we’ll use a pipe: first extract the head() or tail() of the dataset, and then View() it. We’ll also go ahead and view dataf_smol, the subset we created for visdat above.
Some of my observations:
raw_row_number can’t be turned into a numeric valuelocation is a mix of addresses and intersections (“Bond St @ 48TH AVE”)
use_of_force_description doesn’t seem to be a descriptive text field; instead it seems to be mostly missing or “handcuffed”We can also use skimr to check data quality by looking at the minimum and maximum values. Do these ranges make sense for what we expect the variable to be?
## ── Data Summary ────────────────────────
## Values
## Name dataf
## Number of rows 133407
## Number of columns 28
## _______________________
## Column type frequency:
## character 16
## Date 1
## difftime 1
## logical 7
## numeric 3
## ________________________
## Group variables None
##
## ── Variable type: character ────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max empty n_unique whitespace
## 1 raw_row_number 0 1 1 71 0 133407 0
## 2 location 51 1.00 1 78 0 60723 0
## 3 beat 72424 0.457 3 19 0 129 0
## 4 subject_race 0 1 5 22 0 5 0
## 5 subject_sex 90 0.999 4 6 0 2 0
## 6 officer_assignment 121431 0.0898 5 97 0 20 0
## 7 type 20066 0.850 9 10 0 2 0
## 8 outcome 34107 0.744 6 8 0 3 0
## 9 search_basis 92250 0.309 5 14 0 3 0
## 10 reason_for_stop 0 1 14 197 0 113 0
## 11 use_of_force_description 116734 0.125 10 10 0 1 0
## 12 raw_subject_sdrace 0 1 1 1 0 7 0
## 13 raw_subject_resultofencounter 0 1 7 213 0 315 0
## 14 raw_subject_searchconducted 0 1 2 24 0 34 0
## 15 raw_subject_typeofsearch 52186 0.609 2 112 0 417 0
## 16 raw_subject_resultofsearch 111633 0.163 5 95 0 298 0
##
## ── Variable type: Date ─────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max median n_unique
## 1 date 2 1.00 2013-04-01 2017-12-31 2015-07-19 1638
##
## ── Variable type: difftime ─────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max median n_unique
## 1 time 2 1.00 0 secs 86340 secs 16:12 1439
##
## ── Variable type: logical ──────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean count
## 1 arrest_made 0 1 0.121 FAL: 117217, TRU: 16190
## 2 citation_issued 0 1 0.394 FAL: 80836, TRU: 52571
## 3 warning_issued 0 1 0.231 FAL: 102545, TRU: 30862
## 4 contraband_found 92250 0.309 0.149 FAL: 35005, TRU: 6152
## 5 contraband_drugs 92250 0.309 0.0844 FAL: 37684, TRU: 3473
## 6 contraband_weapons 92250 0.309 0.0299 FAL: 39928, TRU: 1229
## 7 search_conducted 0 1 0.309 FAL: 92250, TRU: 41157
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 lat 114 0.999 37.8 0.0284 37.4 37.8 37.8 37.8 38.1 ▁▁▇▁▁
## 2 lng 114 0.999 -122. 0.0432 -122. -122. -122. -122. -119. ▇▁▁▁▁
## 3 subject_age 102724 0.230 33.2 13.3 10 23 29 41 97 ▇▆▃▁▁
More observations:
filter(dataf, date >= '2014-01-01')p50) is 29; 50% of values are between 23 and 41filter(dataf, subject_age >= 18, subject_age < 65)Screenshot of the two summary tables from the Oakland report. Source: https://cao-94612.s3.amazonaws.com/documents/OPD-Racial-Impact-Report-2016-2018-Final-16Apr19.pdf, page 8
lubridate::year() does exactly thiscount## [1] 0.08190918
## [1] 0.03615819
## [1] 0.07065
Peng and Matsui note that plots are useful for both checking and setting expectations
ggplot() to get the yeargeom_bar() gives us counts## Warning: Removed 2 rows containing non-finite values (stat_count).
How about counts per year by race/ethnicity?
## Warning: Removed 2 rows containing non-finite values (stat_count).
Let’s switch from bars to points and lines and change up the color palette
dataf |>
mutate(year = year(date)) |>
ggplot(aes(year, color = subject_race)) +
geom_point(stat = 'count') +
geom_line(stat = 'count') +
scale_color_brewer(palette = 'Set1')## Warning: Removed 2 rows containing non-finite values (stat_count).
## Removed 2 rows containing non-finite values (stat_count).
plotly::ggplotly() creates an interactive version of a ggplot object
## Warning: Removed 2 rows containing non-finite values (stat_count).
## Removed 2 rows containing non-finite values (stat_count).
Let’s translate our natural-language research questions into statistical questions:
Whether Black people in Oakland might be more likely to be stopped than White people \[ \Pr(stopped | Black) \textrm{ vs } \Pr(stopped | White) \]
Whether Black people who are stopped might be more likely to have contraband \[ \Pr(contraband | stopped, searched, Black) \textrm{ vs } \Pr(contraband | stopped, searched, White) \]
\[ \frac{\Pr(stopped | Black)}{\Pr(stopped | White)} = \frac{\Pr(Black|stopped)}{\Pr(Black)} \frac{\Pr(White)}{\Pr(White|stopped)} \]
\[ \Pr(Black|stopped) \]
dataf |>
count(subject_race) |>
mutate(share = n / sum(n)) |>
arrange(desc(share)) |>
mutate(share = scales::percent(share, accuracy = 1))| r/e | residents | stops | ratio |
|---|---|---|---|
| Black | 24% | 59% | 2.5 |
| Hispanic | 27% | 20% | 0.7 |
| White | 27% | 12% | 0.4 |
| API | 16% | 6% | 0.4 |
## What fraction of stops had a search?
dataf |>
count(search_conducted) |>
mutate(share = n / sum(n))Across all subjects, 31% of stops involved a search.
ggplot(dataf, aes(subject_race, fill = search_conducted)) +
geom_bar(position = position_fill()) +
scale_fill_manual(values = c('transparent', 'red'))dataf |>
count(subject_race, search_conducted) |>
group_by(subject_race) |>
mutate(rate = n / sum(n)) |>
ungroup() |>
filter(search_conducted) |>
mutate(rate = scales::percent(rate, accuracy = 1))\[ \Pr(contraband | searched, stopped, Black) \] We want to filter() down to only stops where a search was conducted
dataf |>
filter(search_conducted) |>
ggplot(aes(subject_race, fill = contraband_found)) +
geom_bar(position = position_fill()) +
scale_fill_manual(values = c('transparent', 'blue')) +
ylim(0, .2)## Warning: Removed 5 rows containing missing values (geom_bar).
dataf |>
filter(search_conducted) |>
count(subject_race, contraband_found) |>
group_by(subject_race) |>
mutate(rate = n / sum(n)) |>
ungroup() |>
filter(contraband_found) |>
mutate(rate = scales::percent(rate, accuracy = 1))This preliminary analysis indicates that
What are some further directions we could take this analysis?
Suppose you’ve conducted this EDA because you’re working with an activist organization that promotes defunding the police and prison abolition. Should you share the preliminary findings above with your organization contacts?
What influence would the following factors make to your answer?
What other factors should be taken into account as you decide whether to share your findings? Or not taken into account?
How has this “raw data” been shaped by the journey of the data to get to us?
The lab related to this material is available at https://github.com/data-science-methods/lab-w06-eda.